Student Information

First Name:

Last Name:

Student ID:

Aalto E-mail:

Assignment 3: Programming Part

For this part, you'll be using PySpark to process tweets, i.e., messages generated on twitter.

This Jupyter notebook contains: (i) instructions to setup the programming environment, and (ii) the programming problems. You will use this notebook to write your code and submit it for grading.

Setup Instructions

To run your code, you will need to connect to a Jupyter server maintained by CSC. Please read the following instructions very carefully.

Main Steps

Open your browser and go to server address https://86.50.170.119.
If you are asked for credentials, provide username: student and password: moderndb.
Once you're in the Jupyter "Home" page, upload your copy of the notebook by clicking Upload on the top-right corner.
You will now see the notebook listed in "Home". Click to open it on a new tab and work on the programming problems.
Once you have completed all or part of your work, make sure you download the notebook to your computer. You do that by selecting File > Dowload as > IPython Notebook. At the end of your session (after you download your notebook), select File > Close and Halt.
Submit the notebook that contains your final solutions (code and output) to mycourses.

Saving Your Work

The server will restart after 15 minutes of inaction.

Therefore, if you have made changes to the notebook but want to pause your work, make sure you download the notebook to your computer. You can upload it again when you're ready to resume your work.

What to do if the server restarts while inactive To save the code you wrote, copy and paste it manually to a text file on your computer. You can then follow again the main steps described above.

Small and big dataset

You will work with two datasets:

A small dataset. This dataset is already available. Its purpose is for you to work with it to develop solutions for the assignment.
A big dataset. This dataset will be available on Friday April 1st, 2016, together with instructions on how to access it. You will use it to produce your final solutions for the assignment.

Server Availability

When working with the big dataset, you will be using a bigger cluster than the one you'll be using when working with the small dataset. To see how many other students are using it and if there are available resources, visit this webpage: https://86.50.170.119/resources.

If you receive a message that says that the server is full, allow 15-20 minutes before you try to access it again.

Attention! You must develop and run your code before the day of the deadline. We cannot guarantee support for any failures that happen on that day.

PySpark Setup

Run the following cells once before you start working on your solutions.

Note: If you attempt to create a SparkContext twice, you will get an error.



In [ ]:

    
## Imports and SparkContext

import pyspark
import numpy as np # use numpy for advanced numeric operations
import ast # we'll use this module to read data



In [ ]:

    
## SMALL dataset

## Run this cell *ONLY* if you want to be
## using the SMALL dataset

SMALL_FILE = "file:///data/small.input"

DATA_FILE = SMALL_FILE



In [ ]:

    
## BIG dataset

## Run this cell *ONLY* if you want to be
## using the BIG dataset

BIG_FILE = "hdfs:///moderndb/input_2.5m.input"

## Environment parameters for the big cluster
## Use them ONLY if you're working with the big dataset
import os
os.environ["PYSPARK_PYTHON"]="/opt/conda/bin/python3"
os.environ["SPARK_HOME"]="/usr/hdp/current/spark-client"
os.environ["HDP_VERSION"]="current"

DATA_FILE = BIG_FILE



In [ ]:

    
sc = pyspark.SparkContext()
load_func = sc.textFile

Tweets

Run the following cell to assign the data to an RDD.



In [ ]:

    
data = load_func(DATA_FILE)

Most lines in the file contain a string representation of a python dictionary that represents a tweet. A tweet is a message that has been generated on twitter. Below you see the example of one tweet (one line in the file).

{u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 08:01:38 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [126, 140], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': [{u'id': 2190056023, u'id_str': u'2190056023', u'indices': [3, 9], u'name': u'BBC Outside Source', u'screen_name': u'BBCOS'}]}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': None, u'id': 515410856616935425, u'id_str': u'515410856616935425', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 0, u'retweeted': False, u'retweeted_status': {u'contributors': None, u'coordinates': None, u'created_at': u'Fri Sep 26 07:44:43 +0000 2014', u'entities': {u'hashtags': [], u'symbols': [], u'trends': [], u'urls': [{u'display_url': u'bbc.in/1qBcZ4G', u'expanded_url': u'http://bbc.in/1qBcZ4G', u'indices': [115, 137], u'url': u'http://t.co/8kPs90syqo'}], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'low', u'geo': None, u'id': 515406597829693440, u'id_str': u'515406597829693440', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': None, u'possibly_sensitive': False, u'retweet_count': 3, u'retweeted': False, u'source': u'<a href="http://www.socialflow.com" rel="nofollow">SocialFlow</a>', u'text': u"EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8kPs90syqo", u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Tue Nov 12 10:17:10 +0000 2013', u'default_profile': False, u'default_profile_image': False, u'description': u'Real-time reports from inside the BBC newsroom with @BBCRosAtkins. Combining your sources and ours. BBC World Service radio 10GMT, BBC World News TV 17GMT.', u'favourites_count': 734, u'follow_request_sent': None, u'followers_count': 16500, u'following': None, u'friends_count': 750, u'geo_enabled': False, u'id': 2190056023, u'id_str': u'2190056023', u'is_translator': False, u'lang': u'en-gb', u'listed_count': 340, u'location': u'London', u'name': u'BBC Outside Source', u'notifications': None, u'profile_background_color': u'131516', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme14/bg.gif', u'profile_background_tile': True, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2190056023/1399384694', u'profile_image_url': u'http://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/421268342767222784/17yQM0_d_normal.jpeg', u'profile_link_color': u'009999', u'profile_sidebar_border_color': u'EEEEEE', u'profile_sidebar_fill_color': u'EFEFEF', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'BBCOS', u'statuses_count': 6602, u'time_zone': u'London', u'url': u'http://www.bbc.co.uk/programmes/p01k2bx3', u'utc_offset': 3600, u'verified': True}}, u'source': u'<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>', u'text': u"RT @BBCOS: EU's anti-terrorism chief tells BBC the number of Europeans joining Islamist fighters in Syria and Iraq now 3,000+ http://t.co/8\u2026", u'timestamp_ms': u'1411718498766', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Sun Nov 13 20:46:00 +0000 2011', u'default_profile': True, u'default_profile_image': False, u'description': u'The more I do nothing, the less time I have to do anything', u'favourites_count': 27, u'follow_request_sent': None, u'followers_count': 216, u'following': None, u'friends_count': 747, u'geo_enabled': True, u'id': 411752020, u'id_str': u'411752020', u'is_translator': False, u'lang': u'en', u'listed_count': 1, u'location': u'', u'name': u'Gerald Quinlan', u'notifications': None, u'profile_background_color': u'C0DEED', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'profile_background_tile': False, u'profile_image_url': u'http://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/1661078970/image_normal.jpg', u'profile_link_color': u'0084B4', u'profile_sidebar_border_color': u'C0DEED', u'profile_sidebar_fill_color': u'DDEEF6', u'profile_text_color': u'333333', u'profile_use_background_image': True, u'protected': False, u'screen_name': u'QuinlanQuinlan', u'statuses_count': 9223, u'time_zone': None, u'url': None, u'utc_offset': None, u'verified': False}}

Note that some of the lines might be corrupt -- i.e., they do not contain a tweet, but other information, e.g., a logging message. It is your responsibility to deal with corrupt lines and make sure they do not affect your computation.

If tweet_string is the string representation of one tweet, you can load it into a Python dictionary with the following statement.

tweet = ast.literal_eval(tweet_string)

Problem 4 (20 points)

Use PySpark to write and execute queries for the following tasks (1 - 10). Generate one cell per task.

You are free to reuse (possibly persisted) RDDs and other code from earlier tasks.

Tasks

Find out how many lines (tweets or corrupt) there are there in the data.
Inspect any three tweets and infer the tweet schema from them, as completely as you can.

Each tweet is identified by its unique id value. Some tweets might appear in more than one lines in the data (i.e., the same tweet id might appear in more than one lines). How many tweets are there that appear in only one line?
Each tweet is associated with a lang value, that denotes the language the tweet was written in. How many different languages are there in the dataset?
How many lines are there in the data for each language? (Ignore tweet ids for this task).
How many tweets are there in the data for the english language (lang = 'en')?
Each tweet is associated with a text value that stores the content of the message. What is the minimum and maximum message length in the data? Note: you should compute both values in a single pass over the data.
Consider the text message that appears in a tweet. We define a word to be a maximal sequence of alphanumeric characters found in the text, after the text has been converted to lowercase. For example, if the text is
```
My username is spark123 & my password is spar!.Kk
```
then the words contained in it are the following.
```
my, username, is, spark123, my, password, is, spar, kk
```
Find the 1000 most frequent words in the text messages of all english tweets (lang = 'en') along with the number of their occurances.
Find the 100 most frequent words in english tweets (lang = 'en') that start with each character of the latin alphabet ('a', 'b', ..., 'z'). Use the lowercase version of messages.
Each tweet is associated with a user who generated it; and the user is associated with a unique screen_name. Find the screen_name of the $10$ users with the most english tweets (with distinct tweet ids) in the data.

Note Twitter users are also associated with an id value, different than the tweet id. Make sure you do not confuse the two.